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' Abstract 

d ' We consider a combination of heavily trimmed sums and sample quantiles which 

arises when examining properties of clustering criteria and prove limit theorems. The 
object of interest, which we call the Empirical Cross-over Function, is an L-statistic 
1 whose weights do not comply with the requisite regularity conditions for usage of ex- 

isting limit results. The law of large numbers, CLT and a functional CLT are proven. 
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Suppose W\, W2, • • • , W n for n > 1 are i.i.d random variables with distribution function F. 
If W(i) < W(2) < • • • < Wr n \ are the order statistics, then, we define, for < p < 1, the 
Empirical Cross-over Function (ECF) 



Gn( P ) = W (j) - W W + ^ j2 W {3) - W {k+1} for < p < (1) 

. The function G n is a special case of linear functions of order statistics , 1 < i < n, 

CO ' popularly referred to as L-statistics. L-statistics are usually represented as 

CN 



L n = ^2 a i,nW^, l<i<n, (2) 
•i=i 

where a^ n is a triangular array of constants, referred to as weights. A wide variety of limiting 
results on L-statistics have been derived over the years. We direct the interested reader to 
[I] for a good source of results and relevant references. The asymptotic properties of these 
objects have been determined under suitable regularity conditions, albeit usually not too 
stringent, nevertheless disconcerting on occasions in practice. In this paper, we examine one 
such occasion, wherein we are faced with an L-statistic — the ECF — whose weights are not 
sufficiently smooth. As a consequence, asymptotic normality and a functional limit theorem 
do not follow readily 

Hartigan in his elegant paper [6] derived asymptotic distributions of clustering criteria. 
He employed, what he referred to as the split function, in deriving the limiting results. The 
ECF, G n , arises in a natural manner as the empirical counterpart of a certain functional of 
his split function when we are concerned with random variables having common invertiblc 
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distribution function. The properties of the fc-means clustering procedure for the univariate 
and the multivariate cases have been investigated extensively. Pollard [9] , [10] proved strong 
consistency and asymptotic normality results in the univariate case. Serinko et al. |12j 
proved some weak limit theorem under non regular conditions for the univariate case. With 
the intention of having a more robust procedure for clustering, Garcfa-Escudero et al. [5] , 
[3] propose the trimmed fc-means clustering and provide a central limit theorem for the 
multivariate case. The ECF is an interesting probabilistic object in its own right and can, 
in principle, be extended to dependent sequences or to the multivariate setting. In this 
paper, we prove consistency, a central limit theorem and also an invariance principle for G n . 



2 Empirical Cross-over Function 

In this section, we introduce the necessary constructs from clustering techniques from which 
we develop the ECF. Let Wi, W2, ■ •• , W n be continuous i.i.d random variables with cumu- 
lative distribution function F. We make the following assumptions. 

Al. F is invertible for < p < 1 and absolutely continuous with density /. 

Al. E{Wf}) < 00. 

A3. For < p < 1, F is twice differentiable at F^ 1 (p). 

For < p < 1, consider the the split function of F" 1 at p, as defined in [6], 



B(F,p)=ptf + (l-p)rf l 



l! F l{q)dq ) ' 



where 



If 1 



f-Hp) 

W = - / F- 1 (q)dq = - / wdF, 

P Jq<p P J -00 

If If 00 

Pu = / F- 1 (q)dq = / wdF. 

One way to think of B(F,p) is, as the 'between cluster sum of squares', in the case where 
we are concerned with two clusters in one dimension. Therefore, the value of p € (0, 1) 
maximizing this function, would determine the location at which data is split into two 
clusters. Let us denote that value as po and po is referred to as the split point in [6]. As 
pointed out in [6], the conditions that guarantee the existence and uniqueness of the split 
point, are unclear. Determination of the requisite conditions, alone, is worthy of further 
investigation. However, for the purposes of this paper, those conditions and the split point 
itself are not important. When F is invertible, it is known that the split point po solves 

(Pu-m)Wu + m -2F- 1 ( P )] = 0, (3) 

where the left side is the derivative of B(F,p). Owing to the fact that (fi u — [u) > for all 
< p < 1 , we are interested only in the zero of 

G(p) = ^+ f i u -2F-\p), (4) 
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G n (i) = w w -w {1) + -±-Y,w U} -w {2) >0, 



which we refer to as the cross-over function. The empirical version of the cross-over function 
represents the primary object of this paper. At this juncture, for better exposition, we recall 
the definition of the ECF; for < p < 1, we have 

Gn{p) = \Y. W U) - W W + ^ E W (j) WW) for < P < \- (5) 

Remark 1. Intuition about the ECF is useful here. The term 'cross-over' arises owing to 
the observation that 



and the function crosses over at some 1 < k < n— 1. If fc* is the index at which G n crosses 
over, then W^*) represents the datum at which the data is split leading to the formation of 

two clusters. The term, | Ylj=i ^U) ~ Wfyfc)> can be thought of as a 'distance' between the 
mean of the first k observations, arranged in increasing order, and their maximum value; 
the term, Y^j=k+i ^U) ~ W^fc+i)) represents the 'distance' between the mean of the last 
k observations and their minimum. 

Remark 2. The function G n is a linear combination of order statistics W^, 1 < i < n and 
hence an L-statistic. In the representation of an L-statistic L n shown in ([2]), if the weights 

a i,m 1 < i < n., are of the form ij (^j^ij, where J(u),0 < u < 1, is the weight function, 
then it is possible to obtain an equivalent representation as 

n \n + l I 

i—l x ' 

The form of the weights ai >n represent the smoothness condition which guarantees asymp- 
totic normality (See for e.g., [I], page 227 or [2], page 318). Unfortunately, G n cannot be 
represented in this form, since it has 'bad' weights, in the following sense; For < p < 1, wc 
see that the order statistics W(p re p]) and W^(r ra p]+i) have weights i — 1 and ^ n ^i_, p ^ — 1, 
respectively. This clearly violates the smoothness condition rendering the usage of existing 
results inappropriate. 

Remark 3. Observe that for a fixed < p < 1 

. k ^ r«pi 

I E W U) - w (k) = E w tt) - W (W)> 

J=k+1 1 V Fn 3 = \np-] + l 
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where \x] represents the smallest integer not less than x. For a fixed p € (0, 1), the sums 
shown above are trimmed sums. More precisely, since — > p and ^ 1 ~ p ^ — > 1 — p, they 
represent the case of heavy trimming; asymptotics for which are well known (see for e.g., 
[5] and [IS])- Unfortunately, the two order statistics, W^fnp]) and ^([n(i-p)]): represent 
a formidable obstacle in the use of existing results for asymptotic normality of heavily 
trimmed sums. The function G n is hence some sort of a combination of heavily trimmed 
sums and intermediate order statistics, and asymptotic results for such a combination, to 
our knowledge, are yet to developed. 



3 Limit thereoms for G r , 



In this section, we prove the main results on the asymptotic behavior of the sample cross-over 
function G n . 

Theorem 3.1. Under the assumptions Al and A2 as n — > oo, 

G n ( P ) 4 G( P ). 

Proof. Because we only need to prove consistency for individual components of the ECF, 
it is a relatively easy exercise. However, for a purpose of completeness and in order to 
introduce notation and ideas that will be used in the proof of the subsequent theorem, we 
decided to provide a detailed proof of the law of large numbers for G n . 

For < p < 1, it is well known that W^([~n P ~|) — > F~ 1 {p) at points of continuity of F^ 1 . 
It is also the case that W(r np -|_|_i) 4 F _1 (p), since the necessary condition for fc„-th order 

statistic W(k„) to be consistent for F~ 1 (p) is that ^ — > p (see for instance, [14]). Let us 
define 



71 * 



W r i<F- I 0); 

where Ia is the indicator function of the set A. By the strong law of large numbers, r„ — > p 
w.p.l. Now, 



\np\ ^ 



Therefore, 



k 



(0 



1 

\np] 



\nr n ~\ f n P~\ 
i=l i=[nr„]+l 

It is clear here that if \nr n ~\ + 1 > \np~\ , the upper and lower limits of the second sum are 
interchanged with a negative sign. 
The random sum 



1 



\np\ 



\np\ 

E 

i— \nr n ~\ +1 



ir. 



(0 



< 



< 



\np\ 
1 

\np] 



E i^wi 

i= [nr n ] + l 

\np] - \nr n ]\ {\W ( j np]) \ + |W (rnrB D|) 
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Recall that r n = p + O p (n 1 / 2 ) and hence |W|- np ] | and |W[ nrn ] | converge in probability to 
F~ 1 (p)(see [H], page 308) and \r n — p\ 0. Consequently, we have that 



However, 



\nr n ~\ 



fnp] ^ W i: 

by the law of large numbers of i.i.d random variables. As a consequence, ^ ^(0 — ^ W - 

i=l 

n 

In similar fashion we note that ^(i) ~~ * Mu f° r all < p < 1 and < p < ^. 

i=fe+i 

Combining the above two convergences with the convergence of W^r nj ,-n and W^^p-^!) to 
their identical limits, we have that, G n (p) —> G(p) for each < p < 1. □ 

Remark 4. It is worthwhile to note that the trimmed (at the random level) sum 

is exactly equal to the truncated sum ^i^-w i <F~ 1 (p)i which is the sum of i.i.d. random 

random variables. This subtle relationship is greatly convenient in our proofs. 

For ease of notation, let us define for < p < 1, 

Op = -W{Iwi<F- 1 (p) - F~ 1 (p)Zw 1 <F- 1 ( P ) + ^ — WiT Wl > F -i( p ) 

F-l/VlT f p-l Wl<F -i {p) 

-F (p)I Wl > F - Hp) - 2 ^ 

and C/„(p) = y/n(G n (p) - G(p)) for < a < p < b < 1. 
Theorem 3.2. Under assumptions Al — A3 as n — > oo, 

^(G„(p)-G(p))4iV(0,cr), 
where cr = Var(6*p). Furthermore, 

U n ( P ) U(p), 

in the Skorohod space D[a,b] equipped with the J\ topology, where U(p) is a Gaussian 
process with mean and covariance given by 

Cov(U(p),U{q)) = Cov(6 p ,9 q ). 

Proof. The trick used in proving the asymptotic normality of G n is to consider mean-zero 
asymptotics of its individual components and by the use of Bahadur's representation for 
sample quantiles, rewrite G n as a sum of i.i.d random variables and an error term, which 
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goes to zero at an appropriate rate. This would then pave the way for the usage of standard 
results. 

More specifically, first note that for < p < 1, and each i = 1, • • • , n, E (WiIw i <F- 1 (p)) = 
pm and E (WiX Wi > F -i( p )) = (1 -p)n u - Observe that, for < p < 



1 k 



(i) - npm 



1 



\P\fn\ 
1 

l 

i 



\np\ 

^2 W(i) - npfii 



i=i 



i—l i— [~nr n "]+l 



fnr n ] 



1 



\np\ 
i— [nr^j+l 



\p\fn\ 



F- l (p){\np~\ - [nr n ]). 



Now note that 

\np] 



1 



\py/n\ 



J2 W (l) -F-\p) 

i— [nr n ~\ +1 

l 



< 



\py/n\ 



[np] - \nr n ]\ max ( | W { [np] ) - F" 1 ^, \W ([nrn]) - F-\p)\) . 



By the same argument used in the proof of Theorem 13. 11 | Wi v np i ) — F 1 (p)\ and 



\W, 



F 1 (p) | — ^ 0. By the central limit theorem for i.i.d random variables, \/n\p- 



is asymptotically normal and hence bounded in probability. Consequently, 

1 



\np\ 

]T w^-f-Hp) 

i— [nr n ~\ +1 



o 



Next, recall that 



\nr n ~\ n 

W V) =^2 W ^W z <F-^p), 
i=l i=l 
n 

i=l 



Therefore. 



y/n 



W i <F- 1 (p) —PfJ-l) 



1 " 

+ ^F" 1 (p)^(p-:% <F - 



+ Op(l) 

Vn£ + o p (i), 
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where 



& = - [WiT Wi<F -i( p ) - F 1 (p)l Wi<F -i {p) - (pfjii -pF 1 (p))] 



V 



are i.i.d random variables for i = 1, • • ■ , n and £ 
Using a similar argument, we can claim that 



n & 



1 



[Yi(l-p)l 



•Op(l), 



where 



[WiJ Wi > F -i(p) - l (p)2w i >F- 1 (p) - ((! -p)f*u-pF 1 {p))] 



are i.i.d random variables and t = ^ Tj. That takes care of the two trimmed sums. 

i=i 

Next, we turn our attention to the two quantiles Wqa and W^+i) 01 cquivalcntly f".pl ) 
and W([np]+i) for ^— ^ < p < ^ Using the Bahadur representation for sample quantiles 
(see [2]), justified by assumptions Al and A3 , we have 



^{w {M) -F-\p)) = vMwWi+D - f_1 (p)) = V^ + o p (l), 



where 



~ - L W i <F- 1 (p) 



are i.i.d random variables and k = i Kj. 



4=1 



We are now in a situation where for < p < 1, y/n(G n (p) — G(p)) has been expressed 
as sums of i.i.d random variables along with an error term which is o p (l). That is, 



V^(G n (p)-G(p)) = J2 



•Op(l), 



i=l 



where Zj = ^ + Tj — 2ki are i.i.d random variables. The advantage of this representation 
lies in the fact that we are now allowed to examine G n without having to concern ourselves 
with the correlations between its individual components. The representation ensures that 
the effect of the correlations is of order as that of the error term or smaller and can hence 
be safely disregarded. Consequently, by the central limit theorem for i.i.d random variables 

VE(G n (p)-G(p))AN(0,a). 

We can now turn our attention to the functional limit of the process U n . Since we are 
interested in the behavior of G n for < p < 1 and in particular the point at which it crosses 
zero, we restrict ourselves to examining the behavior of U n in the closed interval [a, b] where 
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a and b are constants bounded away from and 1 respectively. Notice that U n is a natural 
random element of the Skorohod space D[a, b]. It is straightforward to note that by virtue 
of our representation of y/n(G n (p) — G(p)), for each p, as a sum of i.i.d. random variables 
plus an error term of order o p (l), by the central limit theorem for random vectors, we have 
that 

Vn~ (Unipx) - Ufa), ■■■ , U n ( Pk ) - U{p k )) A N (0, E) , 
where A: is a finite positive integer and for i,j = 1, • • • , k, E = ^cr^ with 

Var(6 Pi ) if i = j 

Cov(9 Pi e Pj ) if * ^ j. 

Now. if we can show that the sequence U n is tight, we then have the required convergence 
to U (see [3J, for the necessary arguments). We set about proving tightness in an indirect 
way, as opposed to the usual method of showing that U n concentrates on a compact set in 
D[a, b] with high probability. Consider the components of U n 

_ ( i r«pi ] fP \ 
U ? = ^ ( M § W(i) ~ p I F ' 1{p) ]' U * =VE {W(M) F ~ 1{p)) ' 



c/r = v^(^ (r „ Pl+1) -F- i (p)). 

It is interesting that for every J7" for i = 1, • • ■ , 4 the functional CLT is an established result. 
However, the weak convergence of the individual components U" docs not automatically 
guarantee weak convergence for the sum of the components. But at this point we need 
only the tightness. Since the sum of compact sets is a compact set again, it is easy to 
show that if each component is tight then it is indeed true that the sum is tight with 
respect to the Skorohod metric on D[a, b}. Now, note that U% and U£ are quantile processes 
and converge weakly to a Gaussian process (see p. 308, Q3]) in D[a,b}. Using the result 
from [3, we can claim that {/" and U£ also converge weakly to a limit process in D[a,b]. 
This proves that each U" is relatively compact for each i. Now, since D[a,b] is complete 
and separable with respect to the Skorohod metric (see p. 115, (3J), using the converse of 
Prohorov's theorem (see p. 37 [3]) we can claim that each Uf for i = 1, ■ • ■ ,4 is tight and, 
therefore, U n = U[ l + U% + + U% is tight in D[a, b] equipped with the J\ -topology. 

□ 

We now provide verification of our asymptotic results regarding consistency and asymp- 
totic normality by considering two examples. In both the examples we first generate 1000 
random variables T n = y/n (G„(0.5) — G(0.5)) and obtain the simulated mean and the vari- 
ance. In order to verify asymptotic normality, we generate again 100 random variables T n . 
This is done for different samples sizes n and results are tabulated. 

Example 1. If Wi,W2, - ■ ■ ,W n arc i.i.d iV(0,l), then it can be ascertained quite easily 
that G(0.5) = and a = 2tt — 4 rj 2.2831. The numbers tabulated below offer satisfactory 
evidence about the accuracy of our results. 
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Example 2. In this example, we consider W\, W2, • • • , W n to be i.i.d. exponential random 
variables with mean 1. This represents the archetypal case of a skewed distribution and we 
again check for the accuracy of our results. In this case, G(0.5) = 2(1 — In 2) rj 0.6137 and 
(7 = 8(1 — In 2) w 2.4548. The numbers in the tables below provide further corroborative 
evidence for our limiting results. 



Table 1: Simulated means and variances for different sample sizes. 



Random variables 


N(0,1) 


Exp(l) 


Sample sizes 


n = 100 


77 = 1000 


77 = 10000 


77 = 100 


77 = 1000 


77 = 10000 


Simulated Mean 


-0.017 


0.018 


0.002 


-0.041 


-0.014 


0.0019 


Simulated Variance 


2.407 


2.324 


2.296 


2.491 


2.463 


2.452 



Table 2: p- values for Kolmogorov-Smirnov test for normality 



Random variables 


N(0,1) 


Exp(l) 


77 = 100 


0.751 


0.8786 


77 = 1000 


0.12 


0.2174 


77 = 10000 


0.391 


0.9955 



4 Concluding Remarks 

Despite being an L-statistic, the asymptotic properties of the ECF cannot be studied using 
existing machinery owing to the fact that its weights are not smooth. Asymptotic results 
from heavily trimmed sums are inapplicable to our problem due the presence of the two 
order statistics, W(k) an d W(k+i), with unfriendly weights. The centered ECF, however, 
can be expressed as a sum of i.i.d random variables and an error term, which goes to 
zero at an appropriate rate, by the use of a subtle trick involving truncated sums and the 
Bahadur representation for sample quantiles. Owing to this, the CLT follows immediately 
and what remains is to show that the centered process satisfies the tightness condition for 
the functional CLT . 

Note that the ECF is invariant with respect to shift in the distribution of W'a, but is 
linear with respect to scaling. If we introduce statistic p n (the empirical split point) that 
'solves', in some appropriate sense, the equation 

G n (p)=0, 

then this statistic is invariant with respect to both shifting and scaling (as it should be, 
because the clustering problem is invariant with respect to linear transformations), and 
potentially can be used to design a clustering test. 
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The asymptotics of p n is a problem on which we are currently working, and limit results 
for G n constitute a very important step toward the solution of this problem. According to 
a general plan outlined in Serfling p. 95) we can conjecture that 

Pn~Po- G n (po)/G'(p ), 

where po is a theoretical split point. Based on preliminary simulations it appears to be the 
case. However, the rigorous proof of this statement requires significant efforts. 
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